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Abstract 



We analyze the expected cost of a greedy active learning algorithm. Our analysis extends previous work to a more 
general setting in which different queries have different costs. Moreover, queries may have more than two possible 
responses and the distribution over hypotheses may be non uniform. Specific applications include active learning with 
label costs, active learning for multiclass and partial label queries, and batch mode active learning. We also discuss 
an approximate version of interest when there are very many queries. 



We first motivate the problem by describing it informally. Imagine two people are playing a variation of twenty 
questions. Player 1 selects an object from a finite set, and it is up to player 2 to identify the selected object by asking 
questions chosen from a finite set. We assume for every object and every question the answer is unambiguous: each 
question maps each object to a single answer. Furthermore, each question has associated with it a cost, and the goal of 
player 2 is to identify the selected object using a sequence of questions with minimal cost. There is no restriction that 
the questions are yes or no questions. Presumably, complicated, more specific questions have greater costs. It doesn't 
violate the rules to include a single question enumerating all the objects (Is the object a dog or a cat or an apple or...), 
but for the game to be interesting it should be possible to identify the object using a sequence of less costly questions. 

With player 1 the human expert and player 2 the learning algorithm, we can think of active learning as a game of 
twenty questions. The set of objects is the hypothesis class, the selected object is the optimal hypothesis with respect 
to a training set, and the questions available to player 2 are label queries for data points in the finite sized training set. 
Assuming the data set is separable, label queries are unambiguous questions (i.e. each question has an unambiguous 
answer). By restricting the hypothesis class to be a set of possible labelhngs of the training set (i.e. the effective 
hypothesis class for some other possibly infinite hypothesis class), we can also ensure there is a unique zero-error 
hypothesis. If we set all question costs to 1 , we recover the traditional active learning problem of identifying the target 
hypothesis using a minimal number of labels. 

However, this framework is also general enough to cover a variety of active learning scenarios outside of traditional 
binary classification. 

• Active learning with label costs If different data points are more or less costly to label, we can model these 
differences using non uniform label costs. For example, if a longer document takes longer to label than a 
shorter document, we can make costs proportional to document length. The goal is then to identify the optimal 
hypothesis as quickly as possible as opposed to using as few labels as possible. This notion of label cost is 
different than the often studied notion of misclassification cost. Label cost refers to the cost of acquiring a label 
at training time where misclassification cost refers to the cost of incorrectly predicting a label at test time. 

• Active learning for multiclass and partial label queries We can directly ask for the label of a point (Is the 
label of this point "a", "b", or "c"?), or we can ask less specific questions about the label (Is the label of this point 
"a" or some other label?). We can also mix these question types, presumably making less specific questions less 
costly. These kinds of partial label queries are particularly important when examples have structured labels. In 
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Figure 1: Decision tree view of active learning. Internal nodes are questions (label queries), branches are answers 
(label values), and leaves are target objects (hypotheses). The cost of identifying a target object is the sum of the 
question costs along the path from the root to that object. 

a parsing problem, a partial label query could ask for the portion of a parse tree corresponding to a small phrase 
in a long sentence. 

• Batch mode active learning Questions can also be queries for multiple labels. In the extreme case, there can be 
a question corresponding to every subset of possible single data point questions. Batch label queries only help 
the algorithm reduce total label cost if the cost of querying for a batch of labels is in some cases less than the 
of sum of the corresponding individual label costs. This is the case if there is a constant additive cost overhead 
associated with asking a question or if we want to minimize time spent labeling and there are multiple labelers 
who can label examples in parallel. 

Beyond these specific examples, this setting applies to any active learning problem for which different user interactions 
have different costs and are unambiguous as we have defined. For example, we can ask questions concerning the 
percentage of positive and negative examples according to the optimal classifier (Does the optimal classifier label 
more than half of the data set positive?). This abstract setting also has applications outside of machine learning. 

• Information Retrieval We can think of a question asking strategy as an index into the set of objects which can 
then be used for search. If we make the cost of a question the expected computational cost of computing the 
answer for a given object, then a question asking strategy with low cost corresponds to an index with fast search 
time. For example, if objects correspond to points in 5R" and questions correspond to axis aligned hyperplanes, 
a question asking strategy is a fcd-tree. 

• Compression A question asking strategy produces a unique sequence of responses for each object. If we make 
the cost of a question the log of the number of possible responses to that question, then a question asking strategy 
with low cost corresponds to a code book for the set of objects with small code length (jst]. 

Interpreted in this way, active learning, information retrieval, and compression can be thought of as variations of the 
same problem in which we minimize interaction cost, computation cost, and code length respectively. 

In this work we consider this general problem for average-case cost. The object is selected at random and the goal 
is to minimize the expected cost of identifying the selected object. The distribution from which the object is drawn is 
known but may not be uniform. Previous work ifTll [H S Bl has shown simple greedy algorithms are approximately 
optimal in certain more restrictive settings. We extend these results to our more general setting. 

2 Preliminaries 

We first review the main result of Dasgupta |@] which our first bound extends. We assume we have a finite set of 
objects (for example hypotheses) H with \H\ = n. A randomly chosen h* ^ H is our target object with a known 
positive Tr{h) defining the distribution over H by which h* is drawn. We assume min/i Tr{h) > and \H\ > 1. We also 
assume there is a finite set of questions gi, g2, ■■■'Zm each of which has a positive cost ci, C2, ■■■Cm- Each question qi 
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Algorithm 1 Cost Sensitive Greedy Algorithm 



I: S 

2: repeat 

3: i — aigmax Ai{S,Trs)/ci 

4: {se S : = q,{h*)} 

5: until \S\ = 1 



maps each object to a response from a finite set of answers A = [J^ i{qi{h)} and asking qi reveals qi{h*), eliminating 
from consideration all objects h for which qi{h) ^ qi(h*). An active learning algorithm continues asking questions 
until h* has been identified (i.e. we have eliminated all but one of the elements from H). We assume this is possible 
for any element in H. The goal of the learning algorithm is to identify h* with questions incurring as little cost as 
possible. Our result bounds the expected cost of identifying h* . 

We assume that the distribution tt, the hypothesis class H, the questions qi, and the costs are known. Any 
deterministic question asking strategy (e.g. a deterministic active learning algorithm taking in this known information) 
produces a decision tree in which internal nodes are questions and the leaves are elements of H. The cost of a query 
tree T with respect to a distribution tt, C(r, tt), is defined to be the expected cost of identifying h* when h* is chosen 
according to vr. We can write C(T, tt) as C(r, tt) = J^heH '^{h)cT{h) where CT{h) is the cost to identify h as the 
target object. CT{h) is simply the sum of the costs of the questions along the path from the root of T to h. We define 
Tis to be TT restricted and normalized w.r.t. S. For s G S, t^s{s) — 7r(s)/7r(S'), and for s ^ S, tts{s) = 0. Tree cost 
decomposes nicely. 

Lemma 1. For any tree T and any S = IJ,; S'* with \/,jS' ("1 5^ = 0, S" 7^ 

C{T,ns)=J27rs{S')C{T,ns^) 

i 



We define the version space to be the subset of H consistent with the answers we have received so far. Questions 
eliminate elements from the version space. For a question q,; and a particular version space S C H, we define 

= {s G S* : qi{s) = j}. With this notation the dependence on qi is suppressed but understood by context. As 
shorthand, for a distribution tt we define tt{S) = 12ses '^i^)- average, asking question qi shrinks the absolute 
mass of S with respect to a distribution tt by 

We call this quantity the shrinkage of qi with respect to {S, tt). We note Ai{S, tt) is only defined if tt{S) > 0. If qi 
has cost Ci, we call the shrinkage-cost ratio of qi with respect to (S, tt). 

In previous work IS IH Etlj the greedy algorithm analyzed is the algorithm that at each step chooses the question 
qi that maximizes the shrinkage with respect to the current version space A,;(S', tt^). In our generalized setting, we 
define the cost sensitive greedy algorithm to be the active learning algorithm which at each step asks the question with 
the largest shrinkage-cost ratio Ai{S, ■ns)/ci where S is the current version space. We call the tree generated by this 
method the greedy query tree. See Algorithm[T] Adler and Heeringa fl'] also analyzed a cost-sensitive method for the 
restricted case of questions with two responses and uniform tt, and our method is equivalent to theirs in this case. The 
main result of Dasgupta [!6] is that, on average, with unit costs and yes/no questions, the greedy strategy is not much 
worse than any other strategy. We repeat this result here. 

Theorem 1. Theorem 3 [6] If\A\ = 2 and \/i a = 1, then for any tt the greedy query tree has cost at most 

C(rf,7r) < AC* lnl/(min7r(/i)) 

where C* = miiiT^ C{T, tt). 
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For a uniform, tt, the log term becomes In so the approximation factor grows with the log of the number of 
objects. In the non uniform case, the greedy algorithm can do significantly worse. However, Kosaraju et al. 1 11] and 
Chakaravarthy et al. [3] show a simple rounding method can be used to remove dependence on tt . We first give an 
extension to Theorem[T]to our more general setting. We then show we how to remove dependence on tt using a similar 
rounding method. Interestingly, in our setting this rounding method introduces a dependence on the costs, so neither 
bound is strictly better although together they generalize all previous results. 



3 Cost Independent Bound 

Theorem 2. For any tt the greedy query tree has cost at most 



C(T\tt) < 12C*lnl/(min7r(/i)) 

heH 



where C* = minr C(T, tt). 



What is perhaps surprising about this bound is that the quality of approximation does not depend on the costs 
themselves. The proof follows part of the strategy used by Dasgupta f^. The general approach is to show that if the 
average cost of some question tree is low, then there must be at least one question with high shrinkage-cost ratio. We 
then use this to form the basis of an inductive argument. However, this simple argument fails when only a few objects 
have high probability mass. 

We start by showing the shrinkage of qi monotonically decreases as we eliminate elements from S. 

Lemma 2. Extension of Lemma 6 [6] to non binary queries. IfT C 5 C i7, and T 7^ then, Vi,7r, Ai(r, tt) < 

Proof. For jS"! = 1 the result is immediate since |T| > 1 and therefore S = T. We show that if \S\ > 2, removing 
any single element a E S\T from S does not increase Ai{S, tt). The lemma then follows since we can remove all of 
S\T from S an element at a time. Assume w.l.o.g. a E S'^ for some k. Here let A' = A \ {k} 

{TTiS") - 7r(a))(7r(^) - 7r(5^-)) , ^ 7r(5^)(7r(5) - TrjS^) - 7r(a)) 

A,{S - {a}, tt) = — + > — 

tt{S) - 7r(a) tt{S) - 7r(a) 

We show that this is term by term less than or equal to 

TT{S^^)i7T{S)-TT{S'^)) ^ TTiS^){TT{S)-7T{S^)) 

For the first term 

{TTjS'^) TT{a)){TT{S) - TTjS'^)) ^ tt{S^){tt{S) - TrjS'^ j) 
tt{S) — 7r(a) ~ T^iS) 

because 7r(S') > tt{S'') and 7r(a) > 0. For any other term in the summation, 

7r(ffl)(^(^) - TTjSn - TTja))) ^ TTiSniTrjS) - 7r(ffl)) 
tt{S) — 7r(a) ~ "■('S') 

because Tr(5) - Tr(5^') > 7r(a) > and 7r(5) > 7r(a). □ 

Obviously, the same result holds when we consider shrinkage-cost ratios. 

Corollary 1. IfTCSC H, andT then for any i , tt. A, (T, tt) /c, < A, [S, tt) /a. 

We define the collision probability of a distribution v over Z to be CP(u) = X^zez This is exactly the 
probability two samples from v will be the same and quantifies the extent to which mass is concentrated on only a few 
points (similar to inverse entropy). If no question has a large shrinkage-cost ratio and the collision probability is low, 
then the expected cost of any query tree must be high. 
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Lemma 3. Extension of Lemma 7 fS] to non binary queries and non uniform costs. For any set S and distribution v 
over S, if\/i Ai(5, v)/ci < A/c, then for any R C S with i? 7^ and any query tree T whose leaves include R 

C{T,vr) > ^v(R)(l - CP{vn)) 

Proof. We prove the lemma with induction on \R\. For \R\ = 1, CP{vu) = 1 and the right hand side of the inequality 
is zero. For i? > 1, we lower bound the cost of any query tree on R. At its root, any query tree chooses some qi with 
cost Ci that divides the version space into R^ for j G A. Using the inductive hypothesis we can then write the cost of 
a tree as 

C{T,vn) > c, + Y,vn{Rn^iv{Rn{l~CPivn.))) 

= c, + ^v{R) Y^ivRiR^r - VRiR^fCPM) 

jeA 

= c, + ^v{R){1-1 + Y,MR'?~CP{vr)) 

3&A 



Here we used 

Y,vr{R^)KP{vr,) = YvR{Rrf ^ VR^ir)' = J] ^sM^ = CP{vr) 
jeA jeA reW reR 



We now note v{R){\ - ^^^^ VR{Wf) = v{R) - ^^^^ v{W f/v{R) = A,(i?, v) 

C{T,vr) > c, + ^viR){l~CPivR))-A,iR,v)^ 
= -^viRjil - CP{vs)) + 

Using Corollai-y[Tl Aj(i?, u)/ci < Ai{S, v)/ci < A/c, so Ac^ - Ai{R, v)c > and therefore 

C{R,vs) > ^v{R){l - CP{vr)) 

which completes the induction. □ 

This lower bound on the cost of a tree translates into a lower bound on the shrinkage-cost ratio of the question 
chosen by the greedy tree. 

Corollary 2. Extension of Corollary 8 [B] to non binary queries and non uniform costs. For any S C H with S 9 
and query tree T whose leaves contain S, there must be a question qi with Ai{S, TTs)/ci > (1 — CP(7r5))/C(T, tt^) 

Proof. Suppose this is not the case. Then there is some A/c < (1 — CP(7r5))/C(r, tts) such that Vi Ai{S, tts) /ci < 
A/c. By Lemma[3](with u = vrg, i? = S), 

C{T,7Ts) > 7rs(5)^(l - CP(7rs)) > 7Ts{S)C{T,tts) = C{T,7ts) 



which is a contradiction. □ 

A special case which poses some difficulty for the main proof is when for some S C H we have CP(7r5) > 1/2. 
First note that if CP(7rs) > 1/2 one object ho has more than half the mass of S. In the lemma below, we use 
R = S \{ho}. Also let 6i be the relative mass of the hypotheses in R that are distinct from ho w.r.t. question qi. 
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7r(fto)= 11/20 7r(?ii) = 1/20 for z = 1...9 



Co = cost 



he 



Rq 



S^ = {ho} = {h^,h2,...hg} 



Cd-i cost y'" 



92 



Shrinkage of gi — .495 

All of R separated from ho by qi 

S''={ho,h^} = {hi} fori = 2... 9 




Shrinkage of q2 = .62 

8/9 of R separated from ho by q2 



Cd cost 

i"_A\ Si sj 

Gd cost / Y ^ 



Ik 



Figure 2: Left: Counter example showing that when a single hypothesis Hq contains more than half the mass, the query 
with maximum shrinkage is not necessarily the query that separates the most mass from ho. Right; Notation for this 
case. 



5i — TTfldr G R : qi{ho) ^ qiir)}) In other words, when question qi is asked, R is divided into a set of hypotheses 
that agree with ho (these have relative mass 1 — 5i) and a set of hypotheses that disagree with ho (these have relative 
mass 5i). Dasgupta fil] also treats this as a special case. However, in the more general setting treated here the situation 
is more subtle. For yes or no questions, the question chosen by the greedy query tree is also the question that removes 
the most mass from R. In our setting this is not necessarily the case. The left of Figure |2] shows a counter example. 
However, we can show the fraction of mass removed from R by the greedy query tree is at least half the fraction 
removed by any other question. Furthermore, to handle costs, we must instead consider the fraction of mass removed 
from R per unit cost. 

In this lemma we use i^{ho} to denote the distribution which puts all mass on ho. The cost of identifying ho in a 
tree T* is then C*{ho) ^ 7r{,,„j). 

Lemma 4. Consider any S C H and tt with CP(7r5) > 1/2 and T:{ho) > 1/2. LetC*{ho) — C {T* , -k ^y^^-^) for any 
T* whose leaves contain S. Some question qi has 5i/ Ci > \/C*{ho). 

Proof. There is always a set of questions indexed by the set / with total cost X^ie/ '^i — C'*(/io) that distinguish ho 
from R within S. In particular, the set of questions used to identify ho in T* satisfy this. Since the set identifies ho, 
'l2iGi — 1 which implies 

Because Ci/C*{ho) £ (0, 1] and J2iei Ci/C*(/io) < 1. there must be a q^ such that Si/ci > l/C*{ho). □ 

Having shown that some query always reduces the relative mass of R by l/C*{ho) per unit cost, we now show 
that the greedy query tree reduces the mass of R by at least half as much per unit cost. 

Lemma 5. Consider any tt and S Q H with CP(7r5) > 1/2, 7r(/io) > 1/2, and a corresponding subtree T| in the 
greedy tree. Let C*{ho) — C{T* ^n^^iigy) for any T* whose leaves contain S. The question qi chosen by Tg has 
8,1c, > l/{2C*{ho)). 

Proof. We prove this by showing that the fraction removed from R per unit cost by the greedy query tree's question is 
at least half that of any other question. Combining this with LemmalU we get the desired result. 

We can write the shrinkage of qi in terms of Si. Here let A' = A \ {qi{ho)}. Since 71(5"^' — Tr{ho) + 
{t:{S) - S.niR)), and 7r(5) - 7r(S"?'(''°)) = S,tt{R), we have that 

A,{S,Trs) = (TTsiho) + (1 - S,)7Ts{R))S,7TsiR) + ^ 7rs{S^){ns{S) - 7ts{S^)) 

J 6 A' 
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We use ^^.g^, TTsiSn ^ S,7:s{R). 

We can then upper bound the shrinkage using tts{S) — TTg{S^) < 1 

MS. TTs) < (nsiho) + (1 - S,)T:s{R))5,nsiR) + d,7:s{R) < 25,ns{R) 

and lower bound the shrinkage using TTs{ho) > 1/2 and tts{S) — tts{S^) > Trs{ho) + (1 — Si)TTs{R) for any j E A' 

MS, TTs) > 2(^s(/io) + (1 - S,)7:s{RmTrs{R) > S,7:s{R) 

Let Qi be any question and qj be the question chosen by the greedy tree giving Aj(S', Trs)/cj > Ai{S, TTs)/ci. 
Using the upper and lower bounds we derived, we then know 26jTTs{R) /cj > SiTTs{R) /ci and can conclude 2dj/cj > 
6i/ci. Combining this with LemmaH] Sj/cj > l/(2C*(/io). □ 



The main theorem immediately follows from the next theorem. 
Theorem 3. IfT* is any query tree for tt and is the greedy query tree for tt, then for any S E H corresponding to 



the subtree T| ofT^, 



C{Tl,7:s) < 12C(r*,^s)ln — 



min/,.g5 7r(/i) 



Proof. In this proof we use C*(S') as a short hand for C(T*, TTg). Also, we use min(S') for min^gs 7r(5'). We proceed 
with induction on \S\. For jS"! = 1, C{Tg, its) is zero and the claim holds. For jS*! > 1, we consider two cases. 
Case one: CP(7rg) < 1/2 

At the root of T|, the greedy query tree chooses some qi with cost Ci that reduces the version space to when 
qi{h*) — j. Let 7r(S'+) = max{7r(5-') : j E A} Using the inductive hypothesis 

C{Tl,ns) = c, + Y.''s{SnC{Ts,,ns,) 



< c. + j:^2.siSnC*(Snin- 

< c, + 12(^vr5(5^)C*(5-'-))ln^i^ 

mm[b) 

Now using Lemma[T] 7r(5+) — tt{S)tts{S^), and then ln{l ~ x) < —x 

C{Tl,ns) < c, + 12C*(5)ln^^ + 12C*(5)ln7r5(5+) 

< c, + l2C*iS) In - 12C*(5)(1 - 7rs{S+)) 

775(5*+) > J2jeA ■"'sC'S'"')^ because this sum is an expectation and Vj 775(6'+) > 7ts{S^). From this follows 
C(T|, 775) < Q + 12C*(6)ln^^-12C*(5)(l-^7r5(5^")2) 

mm (6) a 
(1 - EjeA'^siS^f) is A,(5,7r5), so by Corollary|2]and using CP(7r5) < 1/2 

its) < c, + 12C*(5)ln4^-12C*(5)c/"^^^^^^ 



min(S') ' ' C*{S) 

= c, + 12C7*(5)ln4^-12(l-CP(7r5))c. 

< 12c* (5) In ^^"^^ 



min(S') 
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which completes this case. 
Case two: CP(7rs) > 1/2 

The hypothesis with more than half the mass, ho, lies at some depth D in the greedy tree Tg. Counting the root of 
T| as depth 0, D > 1. At depth d > 0, let qq, qi, ...qd-i be the questions asked so far, cq, ci, ...Cd^i be the costs of 
these questions, and Cd — J2i=o '■^'-^^ ^^^^ incuiTed. At the root, Co = 0. 

At depth d < D, we define Rd to be the set of objects other than Iiq that are still in the version space along the 
path to ho- Rf) = S \ {ho} and for d > i?d = Rd-i \ {h : qd-i{h) ^ qd-i{ho)}- In other words, Rd is Rd-i 
with the objects that disagree with ho on qd-i removed. All of the objects in Rd have the same response as ho for 
qo, qi, qd-i- The right of Figure |2] shows this case. 

We first bound the mass remaining in Rd as a function of the label cost incurred so far. For d > 0, using Lemma|5] 

HRd) < 7r{Ro)l[{l~-^)<n{Ro)e-^^/('^'^'^^^^ 
,^0 2G {ho) 

Using this bound, we can bound Cd, the cost of identifying ho (i.e. C{Tg, ho))- First note that 7r(i?£)_i) > min(i?o) 
since at least one object is left in Rd-i- Combining this with the upper bound on the mass of Rd, we have if £) — 1 > 0. 

Cd-1 < 2C*(/io)ln(7r(i?o)/min(i?o)) 

This clearly also holds if Z) — 1 0, since, Co — 0. We now only need to bound the cost of the final question (the 
question asked at level D — 1). If the final question had cost greater than 2C*{ho), then by Lemma|5] this question 
would reduce the mass of the set containing ho to less than 7r(/io). This is a contradiction, so the final question must 
have cost no greater than 2C* (ho). 

Cd < 2C*(/io)ln^^^^ + 2C*(/io) 
mm[Ro) 

We use A'j^_i ^ A\qd-i{hQ). Let s S 5^ be the set of objects removed from Rd-i with the question at depth d— 1 
such that qd-i{s) ~ j, that is Rd-i ^ Rd + UipA' ^d- Let Sd — UipA' ^d- The right of Figure |2] illustrates 

d — 1 d — 1 

this notation. A useful variation of Lemma [T] we use in the following is that for S = U S'^ and D S'^ ~ 0, 

^(5)C*(5) = 7r(^i)C*(5i) + 7r(52)C*(52). 
We can write 

D 

7T{S)C{Tl,7Ts) ^ n{ho)CD+Y^ ASi){Cd + C{Tg,^,ng^)) 

d=ijeA'^_^ 

D D 



< 7riho)CD+Y.niSd)Cd + J2 E ^(3^)^^^* iS^)\n 

d=i d=ij&A',_, mm(5^) 

< TT{ho)CD+7r{Ro)CD + 127r(i?o)C*(i?o)ln4^^°^ 



< 2n{ho)CD + 127r(i?o)C*(i?o) In 



min(i?o) 



min(i?o 



Here a) decomposes the total cost into the cost of identifying ho and the cost of each branch leaving the path to ho- 
For each of these branches the total cost is the cost incurred so far plus the cost of the tree rooted at that branch, b) 
uses the inductive hypothesis, c) uses Vi jS'j fl S'j = and IJ^ Sd ~ Ro, and d) uses tt{Ro) < 7r(/io). Continuing 



rl,.s) < MMC*(M(ln^ + l) + 12.(i^o)C*(i.o)ln^ 
< 47r(/io)C* (/lo) (In 4^ + 1) + 12^(i?o)C* {Ro) In ; ""^^^ 



min(S') min(5) 
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where a) uses our bound on Cd and b) uses Rq C S. Finally 



^(5)C(T|,7rs) < 12^(/io)C*(/^o)ln4^ + 127r(i?o)C*(i?o)ln4^'^^ 



niin(S') niin(S') 



^{S)12C*{S) In 4-^*^^ 



min(S') 



where we use 7r(5) > 2 min(S') and therefore In ^J^^'^g) > In 2 > .5. Dividing both sides by ^{S) gives the desired 
result. □ 

4 Distribution Independent Bound 

We now show the dependence on tt can be removed using a variation of the rounding trick used by Kosaraju et al. 
I ill and Chakaravarthy et al. The intuition behind this trick is that we can round up small values of tt to obtain 



a distribution tt' in which ln(l/ min^^g// 7r'(/i)) = O(lnn) while ensuring that for any tree T, C{T,tt)/C{T,tt') 
is bounded above and below by a constant. Here n = \H\. When the greedy algorithm is applied to this rounded 
distribution, the resulting tree gives an O(logn) approximation to the optimal tree for the original distribution. In our 
cost sensitive setting, the intuition remains the same, but the introduction of costs changes the result. 

Let Cniax = maxi c; and c,nin = mini Ci- In this discussion, we consider irreducible query trees, which we define 
to be query trees which contain only questions with non-zero shrinkage. Greedy query trees will always have this 
property as will optimal query trees. This property let's us assume any path from the root to a leaf has at most n nodes 
with cost at most Cmax^^ because at least one hypothesis is eliminated by each question. Define tt' to be the distribution 
obtained from it by adding Cmin/ (cmax"-"^) mass to any hypothesis h for which TT{h) < Cmin/ (cmax"-^)- Subtract the 
corresponding mass from a single hypothesis hj for which TT{hj ) > l/n (there must at least one such hypothesis). By 
construction, we have that min^ TT'{hi) > Cmin/ (cmax"-^)- We can also bound the amount by which the cost of a tree 
changes as a result of rounding 

Lemma 6. For any irreducible query tree T and tt, 

\c{T,i,) <C{Ty) <^-C{T,^) 



Proof. For the first inequality, let h' be the hypothesis we subtract mass from when rounding. The cost to identify h' , 
CT{h') is at most Cmax'T- Since we subtract at most Cniiii/(c,„ax'T-^) mass and CT{h') < Cmax^i, we then have 

C{T,tt') > C{T,tt) - -^^CT{h') > C{T,tt) - ^ > ic(T,^) 

The last step uses and C(T, tt) > Cmin and n > 2. For the second inequality, we add at most c^-m/ (cmax'^^) mass to 
each hypothesis and J2h ct(^) < Cma^n^, so 

C{T, tt') < C{T, vr) + ^ -^I^crih) < C{T, vr) + ^ < |c(T, tt) 

heH 

The last step again uses C(T, tt) > Cmin and n > 2 □ 

We can finally give a bound on the greedy algorithm applied to tt', in terms of n and Cmax/cmin 
Theorem 4. For any tt the greedy query tree for tt' has cost at most 

C{TS,tt) < 0{C* In(n^)) 

Cmin 

where C* == min^ C(T, tt). 
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Algorithm 2 e- Approximate Cost Sensitive Greedy Algorithm 



I: S 

2: repeat 

3: Find i so {S, ■Ks)/ci > (1 — e) max^ Aj {S, tts)/ cj 
4: S ^ {s e S : q,is) = q,{H)} 
5: until IS*! = 1 



Proof. Let T' be an optimal tree for tt' and r* be an optimal tree for TT. Using Theorem|2l min^ 7r'(/ii) > c„iin/(cmax?^'^)5 
and Lemma|6l 

C{T3,Tr) <2C(Tf,7r') < 72C(T', tt') In(n^) 

Cmin 

<72C(T*,7r')ln(n^) < 108C(T*,7r) In(n^) 

Cmin Cmin 

□ 



5 e- Approximate Algorithm 

Some of the non traditional active learning scenarios involve a large number of possible questions. For example, in the 
batch active learning scenario we describe, there may be a question corresponding to every subset of single data point 
questions. In these scenarios, it may not be possible to exactly find the question with largest shrinkage-cost ratio. It is 
not hard to extend our analysis to a strategy that at each step finds a question qi with 

Ai{S,Trs)/ci > (1 - e)maxAj(5,7r5)/cj 

j 

for e G [0, 1). We call this the e-approximate cost sensitive greedy algorithm. Algorithm |2] outlines this strategy. We 
show e > only introduces an 1/(1 — e) factor into the bound. Kosaraju et al. 1 1 1 ] report a similar extension to their 
result. 

Theorem 5. For any n the e-approximate greedy query tree T has cost at most 

C{T,tt) < (12/(1 -e))C* lnl/(min7r(/i)) 

where C* = min^ C{T, vr). 

This theorem follows from extensions of Corollary|2] Lemma|5] and Theorem[3] The proofs are straightforward, 
but we outline them below for completeness. It is also straightforward to derive a similar extension of Theorem |4l 
This corollary follows directly from Corollary |2] and the e-approximate algorithm. 

Corollary 3. For any S ^ H and query tree T whose leaves contain S, the question qi chosen by an e-approximate 
query tree has Ai{S, '!Ts)/ci > (1 — e)(l — CP(7r5))/C(T, irg) 

This lemma extends Lemma|5]to the approximate case. 

Lemma 7. Consider any vr and 5 C with CP(7r5) > 1/2 and a corresponding subtree T| in an e-approximate 
greedy tree. Let C*{ho) — C{T* , 7T^figy)forany T*. The question qi chosen by T| has 5i/ci > (1 — e)/(2C*(/io)). - 

Proof. The proof follows that of Lemma |5] We show the fraction of R removed for unit cost by the e-approximate 
greedy tree is at least (1 — e)/2 that of any other question. Using Lemma|4]the result then follows. Let qi be any 
question and qj be the question chosen by an e-approximate greedy tree. Aj{S,TTs) /cj > (1 — e)Ai{S,TTs)/ci. 
Using upper and lower bounds from Lemma|5] we then know 2SjTTs{R)/cj > (1 — e)6iTTs{R)/ci and can conclude 
26 j/ (cj(l — e)) > 6i/ci. The lemma then follows from Lemma|4] □ 

Theorem 6. IfT* is any query tree for tt and T'^ is an e-approximate greedy query tree for vr, then for any S ^ H 
corresponding to the subtree T| ofT'^, 

ncpsilon _ \ ^ 12 r^lrr* _ ^ i„ '^{^) 



C(T"P""°",^s) < ^---C(r*,^s)ln 



(1-e) min,jgs7r(/i) 
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fc > 2 


^Tr^n n n 1 f r»rm ■ 
i'lUIl UlliiUilll L'i 


11 n 1 f r\i"m tt 


XvCSUIL 


It r\CQf ciiii f^t q1 n 1 1 
iVUatlicljU CL al. ^^IJ 


Y 


N 


Y 




Dasgupta \6} 


N 


N 


Y 


0(log(l/minh7r(/i))) 


Adler and Heeringa [1] 


N 


Y 


N 


O(logn) 


Chakaravarthy et al. [J] 


Y 


N 


Y 


0(log k log n) 


Chakaravarthy et al. \4) 


Y 


N 


N 


O(logn) 


This paper 


Y 


Y 


Y 


0(log(l/minh7r(/i))) 


This paper 


Y 


Y 


Y 


0(log(n maxi Ci/mirn a)) 



Table 1 : Summary of approximation ratios achieved by related work. Here n is the number of objects, k is the number 
of possible responses, Ci are the question costs, and tt is the distribution over objects. 

Proof. The proof follows very closely that of Theorem [3] and we use the same notation. We again use induction on 
1 51, and the base case holds trivially. 
Case one: CP(7rg) < 1/2 

Using the inductive hypothesis and the same steps as in Theorem[3]one can show 

C(T„7rs) < c, + ^^C (5)ln— ^-^^C (5)c. 

(1 - T.jeA^siS^)'^) is A,{S,'Ks), so using Corollary [3] and CP(7rs) < 1/2. 

ri(rr>^ ^ \ ^ ^ 12 t:{S) 12 1 - CP(7rg) 

= c, + -^C*(5)ln^^-12(l-CP(7rs))c. 
(1 — e) mm(iij 

< 7T^C*(5)ln^(^) 



(1 — e) min(S') 

which completes this case. 
Case two: CP(7rs) > 1/2 

Using Lemma|7jand the same steps and notation as in Theorem[3] 

Using this bound, we can again bound C/j, the cost of identifying Hq. 

Cn < JT^.C* (h,) In ^^ + -^C* (ho) 
(1 - e) mm(i?o) (1 - e) 

The remainder of the case follows the same steps as Theorem[3] □ 



6 Related Work 

Table [T] summarizes previous results analyzing greedy approaches to this problem. A number of these results were 
derived independently in different contexts. Our work gives the first approximation result for the general setting 
in which there are more than two possible responses to questions, non uniform question costs, and a non uniform 
distribution over objects. We give bounds for two algorithms, one with performance independent of the query costs 
and one with performance independent of the distribution over objects. Together these two bounds match all previous 
bounds for less general settings. We also note that Kosaraju et al. Ml ill only mention an extension to non binary queries 
(Remark 1), and our work is the first to give a full proof of an 0(log n) bound for the case of non binary queries and 
non uniform distributions over objects.. 
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Our work and the work we extend are examples of exact active learning. We seek to exactly identify a target 
hypothesis from a finite set using a sequence of queries. Other work considers active learning where it suffices to 
identify with high probability a hypothesis close to the target hypothesis |7, 2]. The exact and approximate problems 
can sometimes be related [10]. 

Most theoretical work in active learning assumes unit costs and simple label queries. An exception, Hanneke 
igtl also considers a general learning framework in which queries are arbitrary and have known costs associated with 
them. In fact, the setting used by Hanneke is more general in that questions are allowed to have more than one valid 
answer for each hypothesis. Hanneke gives worst-case upper and lower bounds in terms of a quantity called the 
General Identification Cost and related quantities. There are interesting parallels between our average-case analysis 
and this worst-case result. 

Practical work incorporating costs in active learning 1 12 , 3l has also considered methods that maximize a benefit- 
cost ratio similar in spirit to the method used here. However, Settles et al. LI 21 suggests this strategy may not be 
sufficient for practical cost savings. 



7 Implications 

We briefly discuss the implications of our result in terms of the motivating applications. 

For the active learning applications, our result shows that the cost-sensitive greedy algorithm approximately min- 
imizes cost compared to any other deterministic strategy using the same set of queries. For the the batch learning 
setting, if we create a question corresponding to each subset of the dataset, then the resulting greedy strategy does 
approximately as well as any other algorithm that makes a sequence of batch label queries. This result holds no matter 
how we assign costs to different queries although restrictions may need to be made in order to ensure computing the 
greedy strategy is feasible. Similarly, for the partial label query setting, the greedy strategy is approximately optimal 
compared to any other active learning algorithm using the same set of partial label queries. 

In the information retrieval domain, our result shows that when the cost of a question is set to be the computational 
cost of determining which branch an object is in, the resulting greedy query tree is approximately optimal with respect 
to expected search time. Although the result only holds for expected search time and for searches for objects in the 
tree (i.e. point location queries), the result is very general. In particular, it makes no restriction on the type of splits 
(i.e. questions) used in the tree, and the result therefore applies to many kinds of search trees. In this application, our 
result specifically improves previous results by allowing for arbitrary mixing of different kinds of splits through the 
use of costs. 

Finally, in the compression domain, our result shows gives a bound on expected code length for top-down greedy 
code construction. Top-down greedy code construction is known to be suboptimal, but our result shows it is approxi- 
mately optimal and generalizes previous bounds. 



8 Open Problems 

Chakaravarthy et al. yj show it is NP-hard to approximate the optimal query tree within a factor of 51 (log n) for 
binary queries and non uniform tt. This hardness result is with respect to the number of objects. Some open questions 
remain. For the more general setting with non uniform query costs, is there an algorithm with an approximation 
ratio independent of both tt and Ci? The simple rounding technique we use seems to require dependence on Ci, but a 
more advanced method could avoid this dependence. Also, can the r2(log n) hardness result be extended to the more 
restrictive case of uniform tt? It would also be interesting to extend our analysis to allow for questions to have more 
than one valid answer for each hypothesis. This would allow queries which ask for a positively labeled example from 
a set of examples. Such an extension appears non trivial, as a straightforward extension assuming the given answer is 
randomly chosen from the set of valid answers produces a tree in which the mass of hypotheses is split across multiple 
branches, affecting the approximation. 

Much work also remains in the analysis of other active learning settings with general queries and costs. Of particu- 
lar practical interest are extensions to agnostic algorithms that converge to the correct hypothesis under no assumptions 
10. 0]- Extensions to treat label costs, partial label queries, and batch mode active learning are all of interest, and these 
learning algorithms could potentially be extended to treat these three sub problems at once using a similar setting. 

For some of these algorithms, even without modification we can guarantee the method does no worse than passive 
learning with respect to label cost. In particular, Dasgupta et al. ITJ and Beygelzimer et al. t2J both give algorithms 
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that iterate through T examples, at each step requesting a label with probability pt- These algorithm are shown to not 
do much worse (in terms of generalization error) than the passive algorithm which requests every label. Because the 
algorithm queries for labels for a subset of T i.i.d. examples, the label cost of the algorithm is also no worse than 
the passive algorithm requesting T random labels. It remains an open problem however to show these algorithms can 
do better than passive learning in terms of label cost (most likely this will require modifications to the algorithm or 
additional assumptions). 
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